Goto

Collaborating Authors

 actual environment



04d212c4eeeb710f170d47f8d5b9b88a-Paper-Conference.pdf

Neural Information Processing Systems

A wide array of control applications, ranging from medical to engineering, fundamentally deals with critical systems, i.e., systems of vital importance where the control actions have to guarantee no harm to the system functionality. Examples include managing nuclear fusion [Degrave et al., 2022], performing robotic surgeries [Datta et al., 2021], and devising patient treatment strategies [Komorowski et al., 2018].



Reinforcement Learning with Adaptive Regularization for Safe Control of Critical Systems

Neural Information Processing Systems

Reinforcement Learning (RL) is a powerful method for controlling dynamic systems, but its learning mechanism can lead to unpredictable actions that undermine the safety of critical systems. Here, we propose RL with Adaptive Regularization (RL-AR), an algorithm that enables safe RL exploration by combining the RL policy with a policy regularizer that hard-codes the safety constraints. RL-AR performs policy combination via a "focus module," which determines the appropriate combination depending on the state--relying more on the safe policy regularizer for less-exploited states while allowing unbiased convergence for well-exploited states. In a series of critical control applications, we demonstrate that RL-AR not only ensures safety during training but also achieves a return competitive with the standards of model-free RL that disregards safety.


Thinking agents for zero-shot generalization to qualitatively novel tasks

arXiv.org Artificial Intelligence

Thinking agents for zero-shot generalization to qualitatively novel tasks The Obelisk Team Astera Institute Emeryville, USA Abstract Intelligent organisms can solve truly novel problems which they have never encountered before, either in their lifetime or their evolution. An important component of this capacity is the ability to "think", that is, to mentally manipulate objects, concepts and behaviors in order to plan and evaluate possible solutions to novel problems, even without environment interaction. To generate problems that are truly qualitatively novel, while still solvable zero-shot (by mental simulation), we use the combinatorial nature of environments: we train the agent while withholding a specific combination of the environment's elements. The novel test task, based on this combination, is thus guaranteed to be truly novel, while still mentally simulable since the agent has been exposed to each individual element (and their pairwise interactions) during training. We propose a method to train agents endowed with world models to make use their mental simulation abilities, by selecting tasks based on the difference between the agent's pre-thinking and post-thinking performance. When tested on the novel, withheld problem, the resulting agent successfully simulated alternative scenarios and used the resulting information to guide its behavior in the actual environment, solving the novel task in a single real-environment trial (zero-shot). 1 Introduction An important aspect of intelligence is the ability to handle novel problems. While simpler organisms are restricted to problems similar to these they have been exposed to during training, and fare badly when faced Correspondance: Thomas Miconi, thomas.miconi@gmail.comwith An major component of this capacity is the ability to think before acting. By'thinking' 1, that is, by internally manipulating concepts and behaviors and evaluating likely outcomes, agents can tackle novel problems never encountered before, by recombining existing knowledge into new solutions. This ability is perhaps the hallmark of what we think of as truly "intelligent" behavior: it is highly prevalent in humans, but is is debated whether it even exists in non-human animals [Suddendorf and Busby, 2003], including mammals such as rodents [Gillespie et al., 2021] or even great apes [Suddendorf et al., 2009, Os-vath, 2010]. Much work in machine learning has focused on training agents with increasingly complex innate behaviors.


Reinforcement Learning with Adaptive Control Regularization for Safe Control of Critical Systems

arXiv.org Artificial Intelligence

Reinforcement Learning (RL) is a powerful method for controlling dynamic systems, but its learning mechanism can lead to unpredictable actions that undermine the safety of critical systems. Here, we propose RL with Adaptive Control Regularization (RL-ACR), an algorithm that enables safe RL exploration by combining the RL policy with a policy regularizer that hard-codes safety constraints. We perform policy combination via a "focus network," which determines the appropriate combination depending on the state -- relying more on the safe policy regularizer for less-exploited states while allowing unbiased convergence for well-exploited states. In a series of critical control applications, we demonstrate that RL-ACR ensures safety during training while achieving the performance standards of model-free RL approaches that disregard safety.


Learning Control Policies for Variable Objectives from Offline Data

arXiv.org Artificial Intelligence

Offline reinforcement learning provides a viable approach to obtain advanced control strategies for dynamical systems, in particular when direct interaction with the environment is not available. In this paper, we introduce a conceptual extension for model-based policy search methods, called variable objective policy (VOP). With this approach, policies are trained to generalize efficiently over a variety of objectives, which parameterize the reward function. We demonstrate that by altering the objectives passed as input to the policy, users gain the freedom to adjust its behavior or re-balance optimization targets at runtime, without need for collecting additional observation batches or re-training.


Skill Decision Transformer

arXiv.org Artificial Intelligence

Recent work has shown that Large Language Models (LLMs) can be incredibly effective for offline reinforcement learning (RL) by representing the traditional RL problem as a sequence modelling problem (Chen et al., 2021; Janner et al., 2021). However many of these methods only optimize for high returns, and may not extract much information from a diverse dataset of trajectories. Generalized Decision Transformers (GDTs) (Furuta et al., 2021) have shown that utilizing future trajectory information, in the form of information statistics, can help extract more information from offline trajectory data. Building upon this, we propose Skill Decision Transformer (Skill DT). Skill DT draws inspiration from hindsight relabelling (Andrychowicz et al., 2017) and skill discovery methods to discover a diverse set of primitive behaviors, or skills. We show that Skill DT can not only perform offline state-marginal matching (SMM), but can discovery descriptive behaviors that can be easily sampled. Furthermore, we show that through purely reward-free optimization, Skill DT is still competitive with supervised offline RL approaches on the D4RL benchmark.


To Explore or Not to Explore: Regret-Based LTL Planning in Partially-Known Environments

arXiv.org Artificial Intelligence

In this paper, we investigate the optimal robot path planning problem for high-level specifications described by co-safe linear temporal logic (LTL) formulae. We consider the scenario where the map geometry of the workspace is partially-known. Specifically, we assume that there are some unknown regions, for which the robot does not know their successor regions a priori unless it reaches these regions physically. In contrast to the standard game-based approach that optimizes the worst-case cost, in the paper, we propose to use regret as a new metric for planning in such a partially-known environment. The regret of a plan under a fixed but unknown environment is the difference between the actual cost incurred and the best-response cost the robot could have achieved if it realizes the actual environment with hindsight. We provide an effective algorithm for finding an optimal plan that satisfies the LTL specification while minimizing its regret. A case study on firefighting robots is provided to illustrate the proposed framework. We argue that the new metric is more suitable for the scenario of partially-known environment since it captures the trade-off between the actual cost spent and the potential benefit one may obtain for exploring an unknown region.


Model-Based Imitation Learning Using Entropy Regularization of Model and Policy

arXiv.org Artificial Intelligence

Approaches based on generative adversarial networks for imitation learning are promising because they are sample efficient in terms of expert demonstrations. However, training a generator requires many interactions with the actual environment because model-free reinforcement learning is adopted to update a policy. To improve the sample efficiency using model-based reinforcement learning, we propose model-based Entropy-Regularized Imitation Learning (MB-ERIL) under the entropy-regularized Markov decision process to reduce the number of interactions with the actual environment. MB-ERIL uses two discriminators. A policy discriminator distinguishes the actions generated by a robot from expert ones, and a model discriminator distinguishes the counterfactual state transitions generated by the model from the actual ones. We derive structured discriminators so that the learning of the policy and the model is efficient. Computer simulations and real robot experiments show that MB-ERIL achieves a competitive performance and significantly improves the sample efficiency compared to baseline methods.